LocVTP: Video-Text Pre-training for Temporal Localization

نویسندگان

چکیده

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited retrieval-based tasks, e.g., video retrieval, whereas their transfer potentials on localization-based temporal grounding, under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current with localization propose a novel Localization-oriented framework, dubbed as LocVTP. Specifically, perform fine-grained contrastive alignment complement coarse-grained one by clip-word correspondence discovery scheme. further enhance reasoning ability learned feature, context projection head aware loss perceive contextual relationships. Extensive experiments four across six datasets that our LocVTP achieves state-of-the-art performance both tasks. Furthermore, conduct comprehensive ablation studies thorough analyses explore optimum model designs training strategies. Codes available at https://github.com/mengcaopku/LocVTP .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Video text extraction using temporal feature vectors

A new caption text extraction algorithm that takes full advantage of the temporal information in a video sequence is developed. By detecting the (dis)appearance of caption text in a video stream, we first identify video segment that contains the same caption text. Then using the gray-level vector traced across the segment as the feature vector for a pixel point, we can clearly separate a captio...

متن کامل

Video text localization using wavelet and shearlet transforms

Text in video is useful and important in indexing and retrieving the video documents efficiently and accurately. In this paper, we present a new method of text detection using a combined dictionary consisting of wavelets and a recently introduced transform called shearlets. Wavelets provide optimally sparse expansion for point-like structures and shearlets provide optimally sparse expansions fo...

متن کامل

Text detection, localization, and tracking in compressed video

Video text information plays an important role in semantic-based video analysis, indexing and retrieval. Video texts are closely related to the content of a video. Usually, the fundamental steps of text-based video analysis, browsing and retrieval consist of video text detection, localization, tracking, segmentation and recognition. Video sequences are commonly stored in compressed formats wher...

متن کامل

Video Text Localization with an emphasis on Edge Features

The text detection and localization plays a major role in video analysis and understanding. The scene text embedded in video consist of high-level semantics and hence contributes significantly to visual content analysis and retrieval. This paper proposes a novel method to robustly localize the texts in natural scene images and videos based on sobel edge emphasizing approach. The input image is ...

متن کامل

An Automatic Video Text Detection, Localization and Extraction Approach

Text in video is a very compact and accurate clue for video indexing and summarization. This paper presents an algorithm regarding word group as a special symbol to detect, localize and extract video text using support vector machine (SVM) automatically. First, four sobel operators are applied to get the EM(edge map) of the video frame and the EM is segmented into N×2N size blocks. Then charact...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2022

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-19809-0_3